273. GCP搞openTelemetry筆記

WHY

要開始弄Google Manage Prometheus後面簡稱GMP,
要用GMP的話,蒐集metric可以用Google的套件,
也可以用OpenTelemetry ,
因為OpenTelemetry後面簡稱Otel ,可以適用於未來還要弄的trace,
早晚都要會的,那不如就一次硬上吧。

Note

基本可分成幾個步驟

1. 實作workload Identity;建立一個ServiceAccount授予monitoring.metricWriter的權限,並且綁在namespace的default上面。

kubectl create ns otel \
&& 
gcloud config set project <project> \
&&
gcloud iam service-accounts create gmp-test-sa \
&&
gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:<project>.svc.id.goog[otel/default]" \
  gmp-test-sa@<project>.iam.gserviceaccount.com \
&&
kubectl annotate serviceaccount \
  --namespace otel \
  default \
  iam.gke.io/gcp-service-account=gmp-test-sa@<project>.iam.gserviceaccount.com \
&&
gcloud projects add-iam-policy-binding <project>\
  --member=serviceAccount:gmp-test-sa@<project>.iam.gserviceaccount.com \
  --role=roles/monitoring.metricWriter

2. 設定Otel

GCP文件先寫config,也不是不可以。
但我建議直接先上deploy跟設定吧。
下方有簡單的說明config設定及修改過的地方。

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel:prometheus-test
rules:
- apiGroups: [""]
  resources:
  - pods
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel:prometheus-test
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: otel:prometheus-test
subjects:
- kind: ServiceAccount
  namespace: otel
  name: default
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: otel
spec:
  replicas: 1
  selector:
    matchLabels:
      app: otel-collector
      app.kubernetes.io/name: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
        app.kubernetes.io/name: otel-collector
      annotations:
        prometheus.io/port: "8888"
    spec:
      containers:
      - name: otel-collector
        image: otel/opentelemetry-collector-contrib:0.120.0
        ports:  
            - name: metrics
              containerPort: 8888
              protocol: TCP
            - name: otlp
              containerPort: 4317
              protocol: TCP
            - name: otlp-http
              containerPort: 4318
              protocol: TCP     
        args:
        - --config
        - /etc/otel/config.yaml
        - --feature-gates=exporter.googlemanagedprometheus.intToDouble
        volumeMounts:
        - mountPath: /etc/otel/
          name: otel-config
      volumes:
      - name: otel-config
        configMap:
          name: otel-config
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
  namespace: otel
spec:
  selector:
    app: otel-collector
  ports:
    - protocol: TCP
      port: 8888
      targetPort: 8888
      name: metrics
---
kind: ConfigMap
apiVersion: v1
metadata:
  name: otel-config
  namespace: otel
data:
  config.yaml: |-
    receivers:
      prometheus:
        config:
          scrape_configs:
          - job_name: 'otel_k8s'
            kubernetes_sd_configs:
            - role: pod
            relabel_configs:
            - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
              action: keep
              regex: (prom-example|kube-state-metrics|otel-collector)
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
              action: replace
              target_label: __metrics_path__
              regex: (.+)
            - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
              action: replace
              regex: (.+):(?:\d+);(\d+)
              replacement: $$1:$$2
              target_label: __address__
            - action: labelmap
              regex: __meta_kubernetes_pod_label_(.+)

    processors:
      resourcedetection:
        detectors: [gcp]
        timeout: 10s

      transform:
        # "location", "cluster", "namespace", "job", "instance", and "project_id" are reserved, and
        # metrics containing these labels will be rejected.  Prefix them with exported_ to prevent this.
        metric_statements:
        - context: datapoint
          statements:
          - set(attributes["exported_location"], attributes["location"])
          - delete_key(attributes, "location")
          - set(attributes["exported_cluster"], attributes["cluster"])
          - delete_key(attributes, "cluster")
          - set(attributes["exported_namespace"], attributes["namespace"])
          - delete_key(attributes, "namespace")
          - set(attributes["exported_job"], attributes["job"])
          - delete_key(attributes, "job")
          - set(attributes["exported_instance"], attributes["instance"])
          - delete_key(attributes, "instance")
          - set(attributes["exported_project_id"], attributes["project_id"])
          - delete_key(attributes, "project_id")

      batch:
        # batch metrics before sending to reduce API usage
        send_batch_max_size: 200
        send_batch_size: 200
        timeout: 5s

      memory_limiter:
        # drop metrics if memory usage gets too high
        check_interval: 1s
        limit_percentage: 65
        spike_limit_percentage: 20

    # Note that the googlemanagedprometheus exporter block is intentionally blank
    exporters:
      googlemanagedprometheus:

    service:
      # 抓log失敗原因
      # telemetry:
      #     logs:
      #         level: DEBUG    
      pipelines:
        metrics:
          receivers: [prometheus]
          processors: [batch, memory_limiter, resourcedetection, transform]
          exporters: [googlemanagedprometheus]
    
      telemetry:
        metrics:
          level: detailed
          readers:
            - pull:
                exporter:
                  prometheus:
                    host: "0.0.0.0"
                    port: 8888 

a. deployment設定

annotations中 prometheus.io/port: "8888"
是由於要抓取otel的metrics,所以特別指定的擷取metrics port,
不然prometheus預設擷取第一個port 4317。

b. config設定

relabel_configs的source_labels __meta_kubernetes_pod_label_app_kubernetes_io_name
這邊指定要抓的metrics,需額外裝kube-state-metrics

prom-example是prometheus的範例檔,用此命令部屬

kubectl -n otel apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.13.0/examples/example-app.yaml

收集器開啟LOG紀錄(除錯)

service:
    telemetry:
        logs:
            level: DEBUG

ref. Failed to scrape Prometheus endpoint

打開otel的port 8888 ,為了接收到metrics。

      telemetry:
        metrics:
          level: detailed
          readers:
            - pull:
                exporter:
                  prometheus:
                    host: "0.0.0.0"
                    port: 8888 

otel的 config基本架構如下。

設定 作用
receivers 接收數據(Prometheus, OTLP, KubeletStats 等)
processors 處理數據(批量處理、轉換、過濾)
exporters 發送數據到不同後端(Prometheus, Loki, OTLP)
service.pipelines 定義數據流向(決定哪些數據要送到哪些地方)

p.s. 附上Otel demo的yaml,有一些基本的設定可以在這參考。

3. 驗證Otel資料

確認Otel-collector沒有錯誤。

在 cloud Metrics Exporer輸入 up 沒看到目標叢集
273-fig.1.png

打開 cluster,將Cloud Monitoring的選項打開,只設定system。

之後就會看到對應的metrics
273-fig.2.png

4. 用Grafana查看GMP

這不算是Otel的相關流程,
但追根究底,用了Otel統整資料,最後還是要給Grafana使用。

Otel 的grafana流程圖說明,Grafana dashboard可至
grafana dashboard下載。

在2025/03/20時,我使用上面的dashboard,預設有很多chart是錯誤的,
追查是因為很多metrics的名稱被改掉,

上面都是我搜尋otel metrics猜的metrics,
也不知道是不是正確,但看起來很像一回事。

273-fig.3.png

題外話

利用GPT問答時,
也有看到另一種架構,
當receivers過多時,在DaemonSet建立Otel-collector,
再匯入到一個Otel ,之後再由這個Otel決定要到哪個Exporter。